CD ROM Paradise Collection 4

home *** CD-ROM | disk | FTP | other *** search

/ CD ROM Paradise Collection 4 / CD ROM Paradise Collection 4 1995 Nov.iso / science / normix21.zip / NORMIX21 / NORMIX.DOC < prev next >

Wrap

Text File | 1995-01-10 | 24KB | 419 lines

PC-NORMIX (Version 2.1) Copyright (C) John H. Wolfe, 1995. All rights reserved. ....................................................................... A. Identification 1. Title= Cluster and pattern analysis of normal mixtures 2. Identification= G7 PC-NORMIX (Version 2.1) 3. Category= Multivariate Statistics 4. Compiler/Operating System= The source code has been tested on four different computer/compiler/operating system combinations and works on all four with little or no modification. Computer: Compiler: Operating System: ---------- -------------------- ----------------- 80286 Microsoft Fortran 5.0 MSDOS 80386/80486 Microsoft PowerStation MSDOS Extended VAX 11/780 f77 unix IBM 4381 FORTVS CMS 5. Date= January 1995 6. Programmer= John H. Wolfe B. Purpose This program is a tool for the user seeking maximum likelihood estimates of the parameters of a mixture of multivariate normal distributions. The program solves the equations for maximum likelihood, using an iterative algorithm. Because mixture problems usually have multiple relative maxima, the program will produce different results, depending on the initial estimates supplied by the user. One should run a variety of other clustering programs first, and then use their results as initial estimates for NORMIX. This procedure has two primary benefits: 1. It will evaluate different solutions produced by other clustering programs by computing their likelihood. 2. Given a solution from another clustering program, it will proceed to generate a "better" solution with greater likelihood. If the user does not input his own initial estimates, a default hierarchical grouping procedure will generate initial estimates for the iterative algorithm. It is not recommended that the user rely solely on this default option. An option permits the user to specify whether the covariance matrices within types will be the same or different. The unequal covariance option requires larger sample sizes for reliable results, in most cases. Also, the unequal covariance case has many singularities in the likelihood function. Fortunately, Peters and Walker (1978) showed that almost surely the likelihood function has a unique maximum within a neighborhood of the population parameters of the mixture for sufficiently large N. This program estimates more parameters than clustering procedures which assume a Euclidean distance. As in multiple regression, the more parameters that are estimated, the larger the sample size should be. If a common covariance matrix is assumed, a good sample size might be 20 times the number of variables. If the clusters have different covariance matrices, a good sample size might be 20 times the number of clusters times the number of variables. References. Wolfe, John H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research, 5, 329-350. Wolfe, John H. (1978). Comparative cluster analysis of patterns of vocational interest. Multivariate Behavioral Research, 13, 33-44. Peters, B. C. & Walker, H. F. (1978). An iterative procedure for obtaining maximum-likelihood estimates of the parameters for a mixture of normal distributions. SIAM J. Appl. Math., 35, 362-378. C. Usage 1. Storage requirements= variable, depending on the data. The requirements are displayed on the screen, prior to execution of the analysis. On the UNIX or CMS systems, storage may be increased by recompiling the main program (NORMIX.FOR) with larger dimension for the A array and the DATA statement that follows DIMENSION A(). This version uses extended memory, as needed, but large problems could fail on computers with small memories. 2. Restrictions= the number of types must not exceed 20. There are no fixed limits on the number of variables or sample size. However, all data and arrays must fit into RAM. 3. Environment= The PC executable file PCNORMIX.EXE requires an 80386 or higher CPU. 4. Error messages= Bad initial estimates or diverging iterations can cause strange estimates of the parameters- such as negative variances or singular correlation matrices which result in system diagnostics. 5. Time= Observed run times for the four sample problems accompanying this program are given below. SAMPLE PROBLEM CHARACTERISTICS AND RUN TIMES Characteristic Irismap Irismix Artificial SVIB ------------------------------------------------------------ Equal Covariances Yes No No Yes Sample Size 150 150 225 113 No. of Variables 4 4 2 22 No. of Types 3-4 3-4 1-4 13-15 Hypothesis:Iterations 3:27 3:13 2:21 13:3 4:19 4:29 3:27 14:2 4:28 15:2 CPU Seconds ------------------------------------------------------------ IBM 4381 7.3 8.4 12.4 28.5 VAX 11/780 55.6 56.1 86.0 188.8 80286 (8 MHz) 795.0 1136.0 1731.0 2860.0 80286+80287 (8 MHz) 157.0 204.0 297.0 517.0 80386 (33 MHz) 184.6 279.2 385.3 501.1 80486DX33 6.9 6.4 9.2 21.2 80486DX66 3.8 3.5 4.8 11.8 On the IBM 360/65, the CPU time in minutes was given by the following two formulas: Common covariance matrix option: Minutes =(1.0e-6)*( 1667*(2*t-7) +m*(2.865*n*v +4.961*m) +4.25*t*v**2 +2.12*t*v**3 +67*n*t +1.23*n*i*t*(v+2) ) . Different covariance matrices option: Minutes =(1.0e-6)*( m*(2.865*n*v +4.961*m) +4.25*t*v**2 +67*n*t +0.62*n*i*t*(v+2) ) . where, t = number of types +1 v = number of variables i = number of iterations (usually 40-80 ) m = number of kmeans n = sample size Time for the 33MHz 486 DX is estimated as 1/25 of the above. 6. Files= Logical I/O unit numbers are assigned to files as follows: 15 Input-Form statements (filename specified interactively, default: "thisjob") 11 Input data (filename specified on Input-Form; if blank, data are read from unit 15 following the format statement on the Input-Form) 12 Printout of the analysis (filename specified on Input-Form, default: "prinout") 3 "kc3temp" Scratch unit containing factor scores 9 "kv9temp" Scratch unit containing raw data 4 "discrim" = the discriminant scores 7 "dumpout" the parameter estimates if iteration limit is reached. These can be used to continue the analysis by including them as initial estimates on a new Input-Form. 7. Input Input to the program comes from three sources: a. Keyboard: one line containing the file name of the Input-Form. b. Input-Form file. c. Data file (filename as specified on the Input-Form) The program prompts the user for the name of the file containing the input-form. However, by using the batch program normix.bat, one can put the input-form file directly on the command line as the first argument. One can also redirect the console output to a script file by using an optional second argument. For example the command normix svib script would read the input-form from the file "svib.inp" and redirect the console output to the file named "script". For a further example, try examples which runs the batch file "examples.bat" to run all of the sample problems supplied with this package. ***Directions for Filling Out the Input-Form**************** The Input-Form is a set of control statements by means of which the user specifies the dimensions of the data and the options he chooses. The alphabetic contents of the Input-Form are ignored by the program, but provide a useful guide to the numerical contents. The file "form.inp" accompanying this documentation is a blank template with the alphabetic contents of the Input-Form printed in. The user should copy the template file onto another file and edit it, filling in the appropriate numerical values and other information. Please note that this form does NOT consist of key words followed by free-form input, as in SAS or SPSS. The parameters must be entered in the exact columns specified. In the Input-Form layout below, fill in values where zeros are ************************************************************************ USER=**** DATE USED=24/07/90 01 TITLE=*** 02 COMMENTS= 03 COMMENTS= 04 NUMBER OF VARIABLES=00 SAMPLE SIZE=0000 05 HYPOTHESES FOR NO. OF TYPES=00,00,00,00,00,00,00,00,00,00,00,00,00, 06 DIFFERENT COVARIANCE MATRIX IN EACH TYPE=0 MINIMUM CLUSTER SIZE=000 07 CONTINUE WITH MORE TYPES IF PROBABILITY OF NULL HYPOTHESIS IS BELOW.000 08 NO. OF INITIAL KMEANS GENERATED=000 MAXIMUM HIERARCHY PRINTED=000 09 MAXIMUM ITERATIONS=000 PRINT ITERATION=0 10 DATA INPUT FILE NAME= 11 PRINTOUT FILE NAME= 12 DATA FORMAT= ( ) 13 INITIAL ESTIMATES, TYPES=00 MEANS ARE READ=0 STD.DEVS=0 CORRELATIONS=0 14 ************************************************************************ ********Additional Remarks on the Input-Form**************** The single-digit zeros in lines 07 and 10 are logical variables, i.e., 1 means yes and 0 means no. Standard default options are invoked when lines 06 through 12 are left blank or zero, but this usage is not recommended. Lines 1-4 will be printed at the top of each page, and the user should fill them with descriptive comments concerning his particular problem. In line 5, col 21-22, enter the number of numeric variables to be analyzed, not including case IDs, if any. In line 6, enter the hypothesized numbers of types in ascending order. If all entries are 00, then 01 ...20 will be used. In line 7, col. 42, the computer will assume equal covariances unless 1 is entered. If different covariance matrices are to be estimated for each type, then col. 66-68, line 7 specifies the minimum number of points a cluster must have for a covariance matrix to be estimated for it. If its size falls below this minimum, its covariance matrix will be re-initialized to the average within-group matrix. In line 8, the program will proceed to a greater number of hypothesized types if the (pseudo-) chi-square test is significant for the likelihood ratio for the current hypothesis/preceding hypothesis. If .999 is entered, the program will ignore this cutoff and continue with the next hypothesis. WARNING: This significance test is known to be incorrect. Col. 34-36 in line 9 is the parameter N in subroutine Kmean and is the size of the sub-sample used to generate initial estimates. The default value is sample size or 2000, whichever is smaller. If the initial estimates are input by the user, the hierarchical grouping may shortened by inserting a positive value greater than or equal to 10. (A positive value smaller than 10 will cause the program to hang.) A very useful way to generate a variety of initial estimates is to vary this parameter. Col. 65-67 in line 9 specifies the number of hierarchical initial clusters displayed on the .XXXXX skyline diagram. The default is 30. Entering 30, or any other number, will produce additional printouts of the first two iterations of the skyline diagrams, and also a printout of the cluster means at each stage of the grouping. If line 10, col.20-22 is left blank, the program will iterate until convergence is obtained. This generally takes 50-100 iterations. If 1 is entered in col. 42, the results of each iteration will be printed. Line 11, columns 22-62 is the file name for input. In MS-DOS and Unix, this is the actual file name as it appears in the directory listing. [On the IBM mainframe CMS, this is a 7-character internal file name which must be related to a corresponding external file name by a CMS FIledef statement preceding invocation of the program.] When the filename on this line is left blank, the data input defaults to the same file as the input form. In this case, the data immediately follow the format statement on line 13 and precede line 14 specifying the initial estimates to be read. Line 12, columns 20-60 is the file name for printout. In MS-DOS and Unix, this is the actual file name as it will appear in the directory listing. [On the IBM mainframe CMS, this is a 7-character internal file name which must be related to a corresponding external file name by a CMS FIledef statement preceding invocation of the program.] When the filename is left blank on this line, the printout defaults to the filename "prinout". Line 13 contains a variable format for reading the data. An option allows reading case IDs of up to 24 alphanumeric characters, using a format entry such as A24. A9 would read a nine-character case ID. If a case ID is to be read, it must be the first variable read. For example (t71,a9,t1,3f5.2/2f5.2) would read a nine-character case ID from columns 71-79, then would read three numeric variables from columns 1-15, then two more numeric variables from the next record in columns 1-10. If no case IDs are to be read, the A format is omitted. ***IMPORTANT*** Several runs should be done with a variety of initial estimates input starting with line 14. Output from other clustering programs should be used, if available. Also, one should do several runs with different values for the number of KMEANS (line 9, cols 34-36). Line 14 follows the data and precedes each set of initial estimates supplied by the user. In col. 26-27, enter the hypothesized number of types in the set of initial estimates. In columns 44, 55, and 70, enter 1 if initial means are to be read, if initial standard deviations are to be read, and if correlations are to be read, respectively. *****************Setup for Initial Estimate ********************* All values are entered in free format in the following order: Line 1a Proportion of population in type 1 Line 1b Means of variables in type 1 Line 2a Proportion of population in type 2 Line 2b Means of variables in type 2 ... ... ... ... ... ... ... ... ... ... Line ra Proportion of population in last type Line rb Means of variables in last type Line c Standard Deviations within type Line d-1 First row of correlations within type Line d-2 Second row of correlations within type ... ... ... ... ... Line d-m Last row of correlations within type If the option for different covariance matrices is selected in line 7 of the input form, then a set of standard deviations and correlation Lines (c-d) follows the means Line for each type. The complete correlation matrix must be read in, but only the upper half is used by the program. Therefore zeros (or any other values) may be used as place-holders in the lower half of the matrix. 8. Output --Hierarchical Grouping--- This routine prints three iterations as follows: 1) Mahalanobis distances from total covariance matrix. 2) Mahalanobis distances from within-group covariances of 10 groups found in step 1. 3) Mahalanobis distances from within-group covariances of 10 groups found in step 2. Rank= line number of the printout Item= individual or object being clustered Kmean= the kmean cluster membership of the item the first item of each kmean begins with a - sign. Stage= the number of groups remaining after this cluster was merged with the preceding cluster. The right hand side of the page is read columnwise. Each column represents a stage in the hierarchical grouping. Each group begins with a period and continues with an X. --Iteration 0 --- This printout gives the initial estimates used to begin the iterative solution to the likelihood equations. These estimates will be the same as the corresponding hierarchical grouping unless the user includes his own initial estimates. --Iteration-last-- gives the converged solution to the likelihood equations. --Probabilities of type membership--(self-explanatory) --Discriminant Functions--- Gives the weights to apply to the raw scores so that discriminant scores will have maximal discrimination between groups and the identity matrix for the within-group covariances. --Cluster Members--- This is a list of case Seq.#s and IDs sorted by cluster. --Printer Plot--- Each point has a number or letter designating the type for which this individual has the highest probability of membership. The first 9 clusters are identified by the digits 1-9. The next 11 clusters are identified by the letters A-K. An * indicates a place where there are two individuals with different cluster membership. Points which lie beyond the boundaries of the graph are projected and plotted at the boundaries. --Summary of Likelihood Statistics At the end of printout of the last hypothesized number of types is a summary page giving the log likelihoods for each hypothesis. [After doing 5-6 runs with different initial estimates, I like to copy this column into a spreadsheet; one column for each initial estimate, then create a new column that is the maximum across all initial estimates. Create another column that is the difference between the current maximum and the maximum for the line above. (The pseudo chi-square is proportional to this.) I make my final decision as to how many clusters there are by plotting these values and taking the one that is just above the line where the values level off.] --Notes on Printing and Editing The printout assumes a page of 60 lines by 120 characters wide. To allow for page numbers and headers, one should use a listing program that prints 66 lines by 133 characters. I like to use Norton Utilities lp /133 listing program. Normix produces voluminous output, which most users will want to edit down. Many editors cannot handle long text files. Norton Desktop for Windows contains a desktop editor that works for long Normix printouts. There are undoubtedly other editors that could be used. ....................................................................... D. Trouble Reports: Results are not guaranteed, and the author assumes no liability for any problems the user may experience in using this program. However, I would be greatly interested in hearing of people's experiences with it. Please report any questions, problems, or difficulties to John H. Wolfe Internet: wolfe@acm.org 4310 Hill Street Telephone: (619) 222-5860 San Diego, Ca. 92107. If this address doesn't work, try the Membership directories for the American Statistical Association or the Classification Society.